Detecting and Recovering from in - Core Hardware Faults
نویسندگان
چکیده
Aggressive scaling of CMOS transistors has enabled extensive system integration and building faster and more efficient systems. On the flip side, this has resulted in an increasing number of devices that fail in shipped components in-the-field for a variety of reasons including soft errors, wear-out failures, and infant mortality. The pervasiveness of the problem across a broad market demands low cost and generic reliability solutions, precluding traditional solutions that employed excessive redundancy or piecemeal solutions that address only a few failure modes. This dissertation presents SWAT (SoftWare Anomaly Treatment), a low cost resiliency solution that effectively handles hardware faults while incurring low cost during the common mode of faultfree operations. SWAT is based on two key observations about the design of resilient systems. First, only those hardware faults that affect software need to be handled and second, since the common mode of operation is fault-free, fault-free execution should incur near-zero overheads. SWAT thus uses novel zero to low cost hardware and software monitors that watch for anomalous software behavior to detect hardware faults. SWAT then relies on hardware support for checkpointing and rollback recovery. When dealing with fault recovery in the presence of I/O, we identify that existing software-level mechanisms that handle output buffering fall short. This dissertation therefore proposes a simple low-cost hardware buffer for output buffering and demonstrates that this strategy achieves high recoverability while incurring low overheads. Although not detailed in this dissertation, SWAT contains a comprehensive diagnosis procedure that is invoked in the rare event of a fault to isolate the root-cause of the fault by distinguishing between software bugs, transient hardware faults, and permanent hardware faults. Effectively, SWAT handles hardware faults uniformly as software bugs, amortizing the resiliency cost across both hardware and software reliability. The results in this dissertation show that the SWAT strategy is effective to detect and recover the
منابع مشابه
Towards a Software-Hardware Co-Designed Resilient System
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of excessive redundancy. We explore a co-designed hardwaresoftware solution that treats most hardware faults as software bugs and leverages common mechanisms for hardware and software rel...
متن کاملDetection of power oscillation and simultaneous faults using Clark transform
Distance relays are widely used to protect transmission lines. Sometimes, in these lines due to the occurrence of the oscillation of the power, the impedance calculated in the distance relay enters into its functional zones and leads to the cutting off of the lines. This issue can cause global power outages. Accordingly, in this paper, a Clark-based method for detecting the oscillation of power...
متن کاملOn Feasibility of Adaptive Level Hardware Evolution for Emergent Fault Tolerant Communication
A permanent physical fault in communication lines usually leads to a failure. The feasibility of evolution of a self organized communication is studied in this paper to defeat this problem. In this case a communication protocol may emerge between blocks and also can adapt itself to environmental changes like physical faults and defects. In spite of faults, blocks may continue to function since ...
متن کاملFast Self-Recovering Controllers
A fast fault-tolerant controller structure is presented, which is capable of recovering from transient faults by performing a rollback operation in hardware. The proposed fault-tolerant controller structure utilizes the rollback hardware also for system mode and this way achieves performance improvements of more than 50% compared to controller structures made fault tolerant by conventional tech...
متن کاملFast Self - Recovering Controllers Andre Hertwig
A fast fault-tolerant controller structure is presented, which is capable of recovering from transient faults by performing a rollback operation in hardware. The proposed fault-tolerant controller structure utilizes the rollback hardware also for system mode and this way achieves performance improvements of more than 50% compared to controller structures made fault tolerant by conventional tech...
متن کامل